Data Science Hackathon¶

Problem Statement (Marketing for Financial Services)¶

  • DB Bank is a large public sector bank whuch has branches across all the cities. It provides various services like savings account, current account, term desposits, personal loans, home loans etc. to customers.
  • Recently, the bank has conducted a campaign to market their term-deposit scheme. Campaigns were conducted based mostly on direct phone calls, soliciting the bank's customers to place a term deposit. After all the marketing efforts, if the client has agreed to place a deposit, then the campaign is success, otherwise not.
  • As a data analyst, we are required to analyse the data which is available at hand, perform some exploratory and descriptive data analysis to identify useful patterns, trends, and insights which will help th marketing team in targeting right customers. Also, we are required to build a predictive model on the given data which can classify the clients on whether they will agree to place a deposit or not correctly.

Step 1: Data Understanding, Cleaning and Missing Value Treatment¶

Importing Necessary Libraries¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from IPython.display import display
pd.options.plotting.backend = 'plotly'
import warnings
warnings.filterwarnings('ignore')

Data Reading and Descriptive Analysis¶

In [2]:
city_master_df = pd.read_csv('Data/City_Master.csv')
state_master_df = pd.read_csv('Dat/State_Master.csv')
region_code_master_df = pd.read_csv('Dat/Region_code_master.csv')
customer_bank_df = pd.read_csv('Dat/Customer_and_bank details_p1.csv')
customer_campaign_df = pd.read_csv('Dat/Customer_campaign_details_p1.csv')
customer_address_df = pd.read_csv('Dat/Customer_Postal_Code_details.csv')
customer_demographics_df = pd.read_csv('Dat/Customer_social_economic_data_p1.csv')
customer_response_df = pd.read_csv('Dat/Customer_Response_data_p1.csv')
In [3]:
display(city_master_df.head())
display(state_master_df.head())
display(region_code_master_df.head())
display(customer_bank_df.head())
display(customer_campaign_df.head())
display(customer_address_df.head())
display(customer_demographics_df.head())
display(customer_response_df.head())
City_Code City_Name State_Code
0 C1 Henderson S1
1 C2 Los Angeles S2
2 C3 Fort Lauderdale S3
3 C4 Concord S4
4 C5 Seattle S5
State_Code State_Name Region_Code
0 S1 Kentucky 3
1 S2 California 4
2 S3 Florida 3
3 S4 North Carolina 3
4 S5 Washington 4
Region_Name Region_Code
0 Central 1
1 East 2
2 South 3
3 West 4
4 North 5
Customer_id age job marital education default housing loan Region_Code State_Code City_Code
0 1 56 services married high.school no no yes 3 S1 C1
1 2 45 services married basic.9y unknown no no 3 S1 C1
2 3 59 admin. married professional.course no no no 4 S2 C2
3 4 41 blue-collar married unknown unknown no no 3 S3 C3
4 5 24 technician single professional.course no yes no 3 S3 C3
Customer_id contact month day_of_week duration campaign pdays previous poutcome
0 1 telephone may mon 307 1 999 0 nonexistent
1 2 telephone may mon 198 1 999 0 nonexistent
2 3 telephone may mon 139 1 999 0 nonexistent
3 4 telephone may mon 217 1 999 0 nonexistent
4 5 telephone may mon 380 1 999 0 nonexistent
customer_id Postal Code
0 1 42420
1 2 42420
2 3 90036
3 4 33311
4 5 33311
Customer_id emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 1 1.1 93.994 -36.4 4.857 5191.0
1 2 1.1 93.994 -36.4 4.857 5191.0
2 3 1.1 93.994 -36.4 4.857 5191.0
3 4 1.1 93.994 -36.4 4.857 5191.0
4 5 1.1 93.994 -36.4 4.857 5191.0
Customer_id y
0 1 no
1 2 no
2 3 no
3 4 no
4 5 no

Action Insights
After heaving a look at the basic structure of all the datasets we have, let us try to merge the customer_bank_df, customer_campaign_df, customer_address_df, customer_demographics_df and customer_response_df based on the Customer_id column and then deep dive into its descriptive statistics

In [4]:
df = pd.merge(left = customer_bank_df, right = customer_campaign_df, on = 'Customer_id', how = 'left')
df = pd.merge(left = df, right = customer_demographics_df, on = 'Customer_id', how = 'left')
customer_address_df.rename(columns = {'customer_id':'Customer_id'}, inplace=True)
df = pd.merge(df, customer_address_df, on = 'Customer_id', how='left')
df = pd.merge(left = df, right = customer_response_df, on = 'Customer_id', how = 'left')
In [5]:
df.head()
Out[5]:
Customer_id age job marital education default housing loan Region_Code State_Code ... pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed Postal Code y
0 1 56 services married high.school no no yes 3 S1 ... 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
1 2 45 services married basic.9y unknown no no 3 S1 ... 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
2 3 59 admin. married professional.course no no no 4 S2 ... 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90036 no
3 4 41 blue-collar married unknown unknown no no 3 S3 ... 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
4 5 24 technician single professional.course no yes no 3 S3 ... 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no

5 rows × 26 columns

In [6]:
df.shape
Out[6]:
(37084, 26)
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37084 entries, 0 to 37083
Data columns (total 26 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Customer_id     37084 non-null  int64  
 1   age             37084 non-null  int64  
 2   job             37084 non-null  object 
 3   marital         37084 non-null  object 
 4   education       37084 non-null  object 
 5   default         37084 non-null  object 
 6   housing         37084 non-null  object 
 7   loan            37084 non-null  object 
 8   Region_Code     37084 non-null  object 
 9   State_Code      37084 non-null  object 
 10  City_Code       37084 non-null  object 
 11  contact         37084 non-null  object 
 12  month           37084 non-null  object 
 13  day_of_week     37084 non-null  object 
 14  duration        37084 non-null  int64  
 15  campaign        37084 non-null  int64  
 16  pdays           37084 non-null  int64  
 17  previous        37084 non-null  int64  
 18  poutcome        37084 non-null  object 
 19  emp.var.rate    37084 non-null  float64
 20  cons.price.idx  37084 non-null  float64
 21  cons.conf.idx   37084 non-null  float64
 22  euribor3m       37084 non-null  float64
 23  nr.employed     37084 non-null  float64
 24  Postal Code     37084 non-null  int64  
 25  y               37084 non-null  object 
dtypes: float64(5), int64(7), object(14)
memory usage: 7.6+ MB
In [8]:
df.duplicated().sum()
Out[8]:
0

Observations

  1. We have 37084 row and 26 colums in our dataset.
  2. All the columns do not seem to have any nan values in them. However, any abnormalities in the values in those columns have to be analysed as there might ne missing values in the format other than np.nan
  3. The data type of the Postal Code column should be object, because different postal codes are indicative of different locations and hence are categorical. The data types of all the other colums look correct and don't need to be changed.
  4. We have a total of 26 columns out of which 14 are categorical and 12 are numeric. Within numeric columns, 5 columns are of discrete type and the remaining 7 are of continuous type.
  5. Customer ID is a unique identifier for each customer, hence it should be removed in the final model building process.
  6. There are no duplicate entries in the dataset.

Converting Postal Code from Int64 to Object Type

In [9]:
df['Postal Code'] = df['Postal Code'].astype('object')

Descriptive Statistics of Numeric and Object Types

In [10]:
df.describe(include = ['int64','float64'])
Out[10]:
Customer_id age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
count 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000
mean 18542.500000 40.042714 258.237946 2.569545 962.530849 0.172986 0.082669 93.576076 -40.505183 3.621668 5167.058664
std 10705.373028 10.432965 258.730909 2.770611 186.773063 0.495681 1.568997 0.578493 4.622045 1.733972 72.196605
min 1.000000 17.000000 0.000000 1.000000 0.000000 0.000000 -3.400000 92.201000 -50.800000 0.634000 4963.600000
25% 9271.750000 32.000000 102.000000 1.000000 999.000000 0.000000 -1.800000 93.075000 -42.700000 1.344000 5099.100000
50% 18542.500000 38.000000 180.000000 2.000000 999.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 27813.250000 47.000000 319.250000 3.000000 999.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 37084.000000 98.000000 4918.000000 56.000000 999.000000 7.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000
In [11]:
df.describe(include = ['object'])
Out[11]:
job marital education default housing loan Region_Code State_Code City_Code contact month day_of_week poutcome Postal Code y
count 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084
unique 12 4 8 3 3 3 5 49 531 2 10 5 3 631 2
top admin. married university.degree no yes no 4 S2 C21 cellular may thu nonexistent 10035 no
freq 9420 22479 10971 29382 19433 30561 11884 7427 3422 23522 12420 7778 32023 974 32876

Observations

  1. The maximum age of the customer in the dataset is 98 while the 75th percentile is 47. So this might be an abnormal value which needs to be looked at ahead.
  2. The maximum duration seems to be 4918 while the 75th percentile is only 319.25. So this again indicates an outlier which needs to be looked at and treated accordingly.
  3. pdays column has most of the data as 999 which indicates that the client wasn't contacted at all. A better alternative of replacing such high indicators could be -1 as it would keep the distribution of the original data intact.
  4. Categorical columns need to be explored for any abnormal categories present in them.
  5. It is a good idea to change the case of all the columns to lowercase for the sake of uniformity & replace any periods with underscore.

Renaming the columns for uniformity¶

In [12]:
df.rename(columns = {'Customer_id':'customer_id',
          'Region_Code':'region_code',
          'State_Code':'state_code',
          'City_Code':'city_code',
          'Postal Code':'postal_code',
          'emp.var.rate':'emp_var_rate',
          'cons.price.idx':'cons_price_idx',
          'cons.conf.idx':'cons_conf_idx',           
          'nr.employed':'nr_employed'}, inplace=True)
In [13]:
df.columns
Out[13]:
Index(['customer_id', 'age', 'job', 'marital', 'education', 'default',
       'housing', 'loan', 'region_code', 'state_code', 'city_code', 'contact',
       'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx',
       'euribor3m', 'nr_employed', 'postal_code', 'y'],
      dtype='object')

Replacing 999 in days with -1

In [14]:
df['pdays'] = np.where(df['pdays']==999,-1,df['pdays'])
In [15]:
df['pdays'].unique()
Out[15]:
array([-1,  6,  4,  3,  5,  1,  0, 10,  7,  8,  9, 11,  2, 12, 13, 14, 15,
       16, 21, 17, 18, 22, 25, 26, 19, 27, 20], dtype=int64)

Analysing categorical columns¶

In [16]:
cat_cols = df.select_dtypes(['object']).columns
In [17]:
for col in cat_cols:
    print("Column name:",col)
    print("Unique categories:",df[col].unique())
    print("Number of unique categories:",df[col].nunique())
    print()
Column name: job
Unique categories: ['services' 'admin.' 'blue-collar' 'technician' 'housemaid' 'retired'
 'management' 'unknown' 'entrepreneur' 'unemployed' 'student'
 'self-employed']
Number of unique categories: 12

Column name: marital
Unique categories: ['married' 'single' 'divorced' 'unknown']
Number of unique categories: 4

Column name: education
Unique categories: ['high.school' 'basic.9y' 'professional.course' 'unknown' 'basic.4y'
 'basic.6y' 'university.degree' 'illiterate']
Number of unique categories: 8

Column name: default
Unique categories: ['no' 'unknown' 'yes']
Number of unique categories: 3

Column name: housing
Unique categories: ['no' 'yes' 'unknown']
Number of unique categories: 3

Column name: loan
Unique categories: ['yes' 'no' 'unknown']
Number of unique categories: 3

Column name: region_code
Unique categories: ['3' '4' 'Na' '1' '2']
Number of unique categories: 5

Column name: state_code
Unique categories: ['S1' 'S2' 'S3' 'S4' 'S5' 'S6' 'S7' 'S8' 'S9' 'S10' 'S11' 'S12' 'S13'
 'S14' 'S15' 'S16' 'S17' 'S18' 'S19' 'S20' 'S21' 'S22' 'S23' 'S24' 'S25'
 'S26' 'S27' 'S28' 'S29' 'S30' 'S31' 'S32' 'S33' 'S34' 'S35' 'S36' 'S37'
 'S38' 'S39' 'S40' 'S41' 'S42' 'S43' 'S44' 'S45' 'S46' 'S47' 'S48' 'S49']
Number of unique categories: 49

Column name: city_code
Unique categories: ['C1' 'C2' 'C3' 'C4' 'C5' 'C6' 'C7' 'C8' 'C9' 'C10' 'C11' 'C12' 'C13'
 'C14' 'C15' 'C16' 'C17' 'C18' 'C19' 'C20' 'C21' 'C22' 'C23' 'C24' 'C25'
 'C26' 'C27' 'C28' 'C29' 'C30' 'C31' 'C32' 'C33' 'C34' 'C35' 'C36' 'C37'
 'C38' 'C39' 'C40' 'C41' 'C42' 'C43' 'C44' 'C45' 'C46' 'C47' 'C48' 'C49'
 'C50' 'C51' 'C52' 'C53' 'C54' 'C55' 'C56' 'C57' 'C58' 'C59' 'C60' 'C61'
 'C62' 'C63' 'C64' 'C65' 'C66' 'C67' 'C68' 'C69' 'C70' 'C71' 'C72' 'C73'
 'C74' 'C75' 'C76' 'C77' 'C78' 'C79' 'C80' 'C81' 'C82' 'C83' 'C84' 'C85'
 'C86' 'C87' 'C88' 'C89' 'C90' 'C91' 'C92' 'C93' 'C94' 'C95' 'C96' 'C97'
 'C98' 'C99' 'C100' 'C101' 'C102' 'C103' 'C104' 'C105' 'C106' 'C107'
 'C108' 'C109' 'C110' 'C111' 'C112' 'C113' 'C114' 'C115' 'C116' 'C117'
 'C118' 'C119' 'C120' 'C121' 'C122' 'C123' 'C124' 'C125' 'C126' 'C127'
 'C128' 'C129' 'C130' 'C131' 'C132' 'C133' 'C134' 'C135' 'C136' 'C137'
 'C138' 'C139' 'C140' 'C141' 'C142' 'C143' 'C144' 'C145' 'C146' 'C147'
 'C148' 'C149' 'C150' 'C151' 'C152' 'C153' 'C154' 'C155' 'C156' 'C157'
 'C158' 'C159' 'C160' 'C161' 'C162' 'C163' 'C164' 'C165' 'C166' 'C167'
 'C168' 'C169' 'C170' 'C171' 'C172' 'C173' 'C174' 'C175' 'C176' 'C177'
 'C178' 'C179' 'C180' 'C181' 'C182' 'C183' 'C184' 'C185' 'C186' 'C187'
 'C188' 'C189' 'C190' 'C191' 'C192' 'C193' 'C194' 'C195' 'C196' 'C197'
 'C198' 'C199' 'C200' 'C201' 'C202' 'C203' 'C204' 'C205' 'C206' 'C207'
 'C208' 'C209' 'C210' 'C211' 'C212' 'C213' 'C214' 'C215' 'C216' 'C217'
 'C218' 'C219' 'C220' 'C221' 'C222' 'C223' 'C224' 'C225' 'C226' 'C227'
 'C228' 'C229' 'C230' 'C231' 'C232' 'C233' 'C234' 'C235' 'C236' 'C237'
 'C238' 'C239' 'C240' 'C241' 'C242' 'C243' 'C244' 'C245' 'C246' 'C247'
 'C248' 'C249' 'C250' 'C251' 'C252' 'C253' 'C254' 'C255' 'C256' 'C257'
 'C258' 'C259' 'C260' 'C261' 'C262' 'C263' 'C264' 'C265' 'C266' 'C267'
 'C268' 'C269' 'C270' 'C271' 'C272' 'C273' 'C274' 'C275' 'C276' 'C277'
 'C278' 'C279' 'C280' 'C281' 'C282' 'C283' 'C284' 'C285' 'C286' 'C287'
 'C288' 'C289' 'C290' 'C291' 'C292' 'C293' 'C294' 'C295' 'C296' 'C297'
 'C298' 'C299' 'C300' 'C301' 'C302' 'C303' 'C304' 'C305' 'C306' 'C307'
 'C308' 'C309' 'C310' 'C311' 'C312' 'C313' 'C314' 'C315' 'C316' 'C317'
 'C318' 'C319' 'C320' 'C321' 'C322' 'C323' 'C324' 'C325' 'C326' 'C327'
 'C328' 'C329' 'C330' 'C331' 'C332' 'C333' 'C334' 'C335' 'C336' 'C337'
 'C338' 'C339' 'C340' 'C341' 'C342' 'C343' 'C344' 'C345' 'C346' 'C347'
 'C348' 'C349' 'C350' 'C351' 'C352' 'C353' 'C354' 'C355' 'C356' 'C357'
 'C358' 'C359' 'C360' 'C361' 'C362' 'C363' 'C364' 'C365' 'C366' 'C367'
 'C368' 'C369' 'C370' 'C371' 'C372' 'C373' 'C374' 'C375' 'C376' 'C377'
 'C378' 'C379' 'C380' 'C381' 'C382' 'C383' 'C384' 'C385' 'C386' 'C387'
 'C388' 'C389' 'C390' 'C391' 'C392' 'C393' 'C394' 'C395' 'C396' 'C397'
 'C398' 'C399' 'C400' 'C401' 'C402' 'C403' 'C404' 'C405' 'C406' 'C407'
 'C408' 'C409' 'C410' 'C411' 'C412' 'C413' 'C414' 'C415' 'C416' 'C417'
 'C418' 'C419' 'C420' 'C421' 'C422' 'C423' 'C424' 'C425' 'C426' 'C427'
 'C428' 'C429' 'C430' 'C431' 'C432' 'C433' 'C434' 'C435' 'C436' 'C437'
 'C438' 'C439' 'C440' 'C441' 'C442' 'C443' 'C444' 'C445' 'C446' 'C447'
 'C448' 'C449' 'C450' 'C451' 'C452' 'C453' 'C454' 'C455' 'C456' 'C457'
 'C458' 'C459' 'C460' 'C461' 'C462' 'C463' 'C464' 'C465' 'C466' 'C467'
 'C468' 'C469' 'C470' 'C471' 'C472' 'C473' 'C474' 'C475' 'C476' 'C477'
 'C478' 'C479' 'C480' 'C481' 'C482' 'C483' 'C484' 'C485' 'C486' 'C487'
 'C488' 'C489' 'C490' 'C491' 'C492' 'C493' 'C494' 'C495' 'C496' 'C497'
 'C498' 'C499' 'C500' 'C501' 'C502' 'C503' 'C504' 'C505' 'C506' 'C507'
 'C508' 'C509' 'C510' 'C511' 'C512' 'C513' 'C514' 'C515' 'C516' 'C517'
 'C518' 'C519' 'C520' 'C521' 'C522' 'C523' 'C524' 'C525' 'C526' 'C527'
 'C528' 'C529' 'C530' 'C531']
Number of unique categories: 531

Column name: contact
Unique categories: ['telephone' 'cellular']
Number of unique categories: 2

Column name: month
Unique categories: ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'mar' 'apr' 'sep']
Number of unique categories: 10

Column name: day_of_week
Unique categories: ['mon' 'tue' 'wed' 'thu' 'fri']
Number of unique categories: 5

Column name: poutcome
Unique categories: ['nonexistent' 'failure' 'success']
Number of unique categories: 3

Column name: postal_code
Unique categories: [42420 90036 33311 90032 28027 98103 76106 53711 84084 94109 68025 19140
 84057 90049 77095 75080 77041 60540 32935 55122 48185 19901 47150 10024
 12180 90004 60610 85234 22153 10009 49201 38109 77070 35601 94122 27707
 60623 29203 55901 55407 97206 55106 80013 28205 60462 10035 50322 43229
 37620 19805 61701 85023 95661 64055 91104 43055 53132 85254 95123 98105
 98115 73034 90045 19134 88220 78207 77036 62521 71203 6824 75051 92374
 45011 7090 19120 44312 80219 75220 37064 90604 48601 44256 43017 48227
 38401 33614 95051 55044 92037 77506 94513 27514 7960 45231 94110 90301
 33319 80906 7109 48180 8701 22204 80004 7601 33710 19143 90805 92345
 37130 84041 78745 1852 31907 6040 78550 85705 62301 2038 33024 98198
 61604 89115 2886 33180 28403 92646 40475 80027 1841 39212 48187 10801
 28052 32216 47201 13021 73071 94521 60068 79109 11757 90008 92024 77340
 14609 72701 92627 80134 30318 64118 59405 48234 33801 36116 85204 60653
 54302 45503 92804 98270 97301 78041 75217 43123 10011 48126 31088 94591
 92691 48307 7060 85635 98661 60505 76017 40214 75081 44105 75701 27217
 22980 19013 27511 32137 10550 48205 33012 11572 92105 60201 48183 55016
 71111 50315 93534 23223 28806 92530 68104 98026 92704 53209 41042 44052
 7036 93905 8901 17602 3301 21044 75043 6360 22304 43615 87401 92503 90503
 78664 92054 33433 23464 92563 28540 52601 98502 20016 65109 63376 61107
 33142 78521 10701 94601 28110 20735 30076 72401 47374 94509 33030 46350
 48911 44221 89502 22801 92025 48073 20852 33065 14215 33437 39503 93727
 27834 11561 35630 31204 52402 2908 81001 94533 32725 42071 6457 11520
 90660 84604 84062 30080 24153 44134 36608 2740 75061 8360 85301 14304
 27360 92683 38301 75019 91767 89031 18103 19711 85281 92677 8302 2149
 13601 54915 98006 75002 79907 76051 75007 37167 98031 70506 97224 60076
 75023 23434 46203 7002 28314 27405 21215 53142 66062 98002 74133 97756
 27604 74403 6450 42104 46614 6010 89015 99207 76248 45014 32127 97504
 22901 59801 33178 29501 97477 32712 19601 80020 65807 7501 73120 23320
 79424 65203 37604 36830 92404 1453 59715 85345 44107 8861 91761 91730
 56560 75150 95207 32174 94086 3820 17403 77840 63116 2169 95336 44240
 76903 84106 35810 37918 72209 48146 43302 80122 5408 4401 38671 47362
 48640 57103 80525 47905 37042 95823 91360 2148 1040 87105 89431 92236
 60126 7055 29406 23602 14701 46544 43402 92253 32303 37211 98226 60098
 76117 60090 29483 71901 80112 43130 88001 35244 75034 95687 84107 53186
 93309 33068 45373 78415 90278 32839 7050 70601 60035 11550 46060 55124
 29464 48310 54703 78577 59102 97030 37421 83642 92307 60440 55369 95695
 77489 77581 94403 49505 93277 66212 92592 92399 2151 77301 60477 52001
 48127 87505 28601 60188 56301 33161 46226 33317 34952 29730 79762 53214
 91911 66502 16602 80229 61821 47401 71854 78539 77520 46142 90712 2895
 54880 76021 98042 74012 33023 33021 77536 67212 78501 52240 83704 2920
 61032 77642 95610 75056 98052 32114 86442 46368 58103 46514 91776 33063
 30328 44060 73505 23666 13440 54601 83501 39401 94526 48858 84321 6708
 30605 4240 61832 85323 30062 85364 54401 99301 60302 32503 77573 20877
 84043 35401 92553 40324 80538 85224 59601 63122 76706 48066 60423 18018
 55113 68801 55125 48237 72756 88101 33458 93101 75104 68701 84020 48104
 91941 83201 49423 6460 60089 92630 96003 95928 13501 72032 82001 42301
 83605 70065 3060 38134 94061 37087 93454 60016 98632 37075 50701 2138
 60067 1915 97405 93030 98059 60025 33445 80022 77590 27893 87124 27534
 98208 90640 92020 77705 33407 79605 61761 63301 60174 93010 97123 91505
 95351 67846 8401 80501 95616 26003 95037 7011 53081 30344 57701 1810
 34741 6484 6810 52302 32771 78666 80634 76063 44035 83301 33134 60441
 1752 20707 77803 71603 57401 21740 7017 60004 60543 55433 92672 94568
 93405 72762 95240 77571 45040 30188]
Number of unique categories: 631

Column name: y
Unique categories: ['no' 'yes']
Number of unique categories: 2

Observations

  1. Columns of job, marital, education, default, housing, loan seem to have a category 'unknown' which needs further analysis
  2. region_code seems to have 'Na' in the category which needs to be explored. One way would be to map these Na values to their corresponding regions using the information of corresponding state_code
  3. poutcome has 'nonexistent' as one of the categories which need further analysis
  4. All other categorical columns seem to have correct number of categories in them
In [18]:
df['job'] = df['job'].replace('admin.','admin')

Filling any 'Na' in region_code with corresponding regions¶

In [19]:
def get_state_region(state):
    region = state_master_df.loc[state_master_df.State_Code == state]['Region_Code'].values[0]
    return region
In [20]:
df.region_code = df.state_code.map(lambda x:get_state_region(x))
In [21]:
df.region_code = df.region_code.astype('object')
In [22]:
df
Out[22]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
0 1 56 services married high.school no no yes 3 S1 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
1 2 45 services married basic.9y unknown no no 3 S1 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
2 3 59 admin married professional.course no no no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90036 no
3 4 41 blue-collar married unknown unknown no no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
4 5 24 technician single professional.course no yes no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37079 37080 73 retired married professional.course no yes no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10009 yes
37080 37081 46 blue-collar married professional.course no no no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10011 no
37081 37082 56 retired married university.degree no yes no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10009 no
37082 37083 44 technician married professional.course no no no 4 S17 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 85254 yes
37083 37084 74 retired married professional.course no yes no 1 S6 ... -1 1 failure -1.1 94.767 -50.8 1.028 4963.6 79109 no

37084 rows × 26 columns

Exploring unknown values¶

In [23]:
df.loc[df.job == 'unknown']
Out[23]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
23 24 55 unknown married university.degree unknown unknown unknown 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
28 29 55 unknown married basic.4y unknown yes no 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
66 67 57 unknown married unknown unknown no no 1 S11 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 60610 no
82 83 57 unknown married unknown unknown yes no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 94122 no
134 135 38 unknown divorced high.school unknown yes no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 95661 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36353 36354 59 unknown married unknown no no no 2 S16 ... 6 1 success -1.7 94.027 -38.3 0.900 4991.6 10009 yes
36403 36404 64 unknown married unknown no yes no 1 S27 ... -1 0 nonexistent -1.7 94.027 -38.3 0.905 4991.6 74133 yes
36608 36609 67 unknown divorced unknown unknown yes no 2 S16 ... 6 2 success -1.1 94.199 -37.5 0.880 4963.6 10024 yes
36920 36921 63 unknown married professional.course no no no 4 S5 ... 6 1 success -1.1 94.601 -49.5 1.025 4963.6 98105 no
37008 37009 64 unknown married unknown no no no 2 S16 ... 3 3 success -1.1 94.767 -50.8 1.048 4963.6 10024 no

306 rows × 26 columns

In [24]:
306/37084 *100
Out[24]:
0.8251537051019309
In [25]:
df.loc[df.marital == 'unknown']
Out[25]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
33 34 58 management unknown university.degree no yes no 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
346 347 59 retired unknown university.degree unknown no no 2 S32 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 1852 no
367 368 57 retired unknown basic.4y no no no 2 S30 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 6040 no
1334 1335 33 services unknown high.school no yes yes 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 90049 no
1435 1436 32 unknown unknown university.degree no no no 2 S25 ... -1 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 44134 yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34699 34700 31 entrepreneur unknown university.degree no no no 2 S40 ... -1 0 nonexistent -3.4 92.431 -26.9 0.722 5017.5 21215 yes
34726 34727 31 entrepreneur unknown university.degree no yes no 1 S12 ... -1 1 failure -3.4 92.431 -26.9 0.722 5017.5 55901 no
35393 35394 30 technician unknown university.degree no no no 4 S28 ... 11 2 success -1.8 93.369 -34.8 0.639 5008.7 88101 yes
35394 35395 30 technician unknown university.degree no no no 4 S28 ... 6 2 success -1.8 93.369 -34.8 0.639 5008.7 88101 yes
36791 36792 30 technician unknown university.degree no no no 2 S10 ... 16 1 success -1.1 94.601 -49.5 0.942 4963.6 19140 no

72 rows × 26 columns

In [26]:
72/37084*100
Out[26]:
0.19415381296516018
In [27]:
df.loc[df.education == 'unknown']
Out[27]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
3 4 41 blue-collar married unknown unknown no no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
6 7 41 blue-collar married unknown unknown no no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90032 no
21 22 59 technician married unknown no yes no 1 S9 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 68025 no
24 25 46 admin married unknown no no no 4 S8 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 84057 no
25 26 49 blue-collar married unknown no no no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90049 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37016 37017 34 technician married unknown no yes no 2 S16 ... -1 2 failure -1.1 94.767 -50.8 1.046 4963.6 10011 no
37018 37019 60 admin married unknown no no no 3 S3 ... -1 0 nonexistent -1.1 94.767 -50.8 1.046 4963.6 33801 no
37020 37021 34 technician married unknown no no no 3 S33 ... -1 0 nonexistent -1.1 94.767 -50.8 1.046 4963.6 30080 yes
37033 37034 54 technician married unknown no yes no 4 S2 ... -1 1 failure -1.1 94.767 -50.8 1.041 4963.6 92804 no
37073 37074 34 student single unknown no yes no 3 S1 ... -1 2 failure -1.1 94.767 -50.8 1.031 4963.6 42420 no

1549 rows × 26 columns

In [28]:
1549/37084*100
Out[28]:
4.177003559486571
In [29]:
df.loc[df.default == 'unknown']
Out[29]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
1 2 45 services married basic.9y unknown no no 3 S1 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
3 4 41 blue-collar married unknown unknown no no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
6 7 41 blue-collar married unknown unknown no no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90032 no
11 12 54 retired married basic.9y unknown yes yes 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90032 no
13 14 46 blue-collar married basic.6y unknown yes yes 4 S5 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 98103 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36817 36818 70 retired married basic.4y unknown no no 4 S2 ... 14 3 failure -1.1 94.601 -49.5 0.965 4963.6 91360 no
36865 36866 49 admin married high.school unknown no no 4 S5 ... 6 3 success -1.1 94.601 -49.5 0.985 4963.6 98105 yes
36866 36867 66 retired married basic.4y unknown unknown unknown 1 S12 ... 6 3 success -1.1 94.601 -49.5 0.985 4963.6 55407 yes
36891 36892 84 retired divorced basic.4y unknown yes no 2 S25 ... -1 0 nonexistent -1.1 94.601 -49.5 0.993 4963.6 43229 no
36904 36905 84 retired divorced basic.4y unknown yes yes 3 S20 ... 3 1 success -1.1 94.601 -49.5 1.000 4963.6 36830 no

7700 rows × 26 columns

In [30]:
7700/37084*100
Out[30]:
20.76367166432963
In [31]:
df.loc[df.housing == 'unknown']
Out[31]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
23 24 55 unknown married university.degree unknown unknown unknown 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
73 74 51 blue-collar married basic.4y unknown unknown unknown 3 S19 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 38109 no
236 237 27 blue-collar single basic.6y no unknown unknown 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 95051 no
341 342 59 management married basic.4y unknown unknown unknown 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
356 357 48 services married high.school unknown unknown unknown 2 S16 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 10035 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36866 36867 66 retired married basic.4y unknown unknown unknown 1 S12 ... 6 3 success -1.1 94.601 -49.5 0.985 4963.6 55407 yes
36907 36908 31 admin single university.degree no unknown unknown 2 S16 ... 3 1 success -1.1 94.601 -49.5 1.008 4963.6 10035 yes
36929 36930 34 admin divorced university.degree no unknown unknown 1 S11 ... -1 1 failure -1.1 94.601 -49.5 1.025 4963.6 60505 no
36941 36942 46 management married university.degree no unknown unknown 4 S2 ... 12 2 failure -1.1 94.601 -49.5 1.032 4963.6 94110 no
37013 37014 42 services divorced university.degree no unknown unknown 2 S16 ... -1 1 failure -1.1 94.767 -50.8 1.046 4963.6 10011 yes

882 rows × 26 columns

In [32]:
882/37084*100
Out[32]:
2.378384208823212
In [33]:
df.loc[df.loan == 'unknown']
Out[33]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
23 24 55 unknown married university.degree unknown unknown unknown 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
73 74 51 blue-collar married basic.4y unknown unknown unknown 3 S19 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 38109 no
236 237 27 blue-collar single basic.6y no unknown unknown 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 95051 no
341 342 59 management married basic.4y unknown unknown unknown 2 S10 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 19140 no
356 357 48 services married high.school unknown unknown unknown 2 S16 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 10035 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36866 36867 66 retired married basic.4y unknown unknown unknown 1 S12 ... 6 3 success -1.1 94.601 -49.5 0.985 4963.6 55407 yes
36907 36908 31 admin single university.degree no unknown unknown 2 S16 ... 3 1 success -1.1 94.601 -49.5 1.008 4963.6 10035 yes
36929 36930 34 admin divorced university.degree no unknown unknown 1 S11 ... -1 1 failure -1.1 94.601 -49.5 1.025 4963.6 60505 no
36941 36942 46 management married university.degree no unknown unknown 4 S2 ... 12 2 failure -1.1 94.601 -49.5 1.032 4963.6 94110 no
37013 37014 42 services divorced university.degree no unknown unknown 2 S16 ... -1 1 failure -1.1 94.767 -50.8 1.046 4963.6 10011 yes

882 rows × 26 columns

In [34]:
882/37084*100
Out[34]:
2.378384208823212
In [35]:
df.poutcome.value_counts()
Out[35]:
nonexistent    32023
failure         3822
success         1239
Name: poutcome, dtype: int64

Action Insights

  1. Since the percentage of unknown values in the columns of job, marital, education, housing and loan are very less (<5%) of the total number of data points, we can successfully impute these unknowns with the mode of these columns.
  2. Default column seems to have approximately 20% of the data points as unknowns. So, rather than imputing them by mode, we can consider them as a separate category at the time of model building.
  3. poutcome column also seems to have a nonexistent category which is the largest in number. This is actually possible as the persons who weren't contacted in the previous campaign will not have any success or failure outcome corresponsing to them. However, we need to check that the value of previous column corresponding to these nonexistent values are 0 or not.

Checking for the previous column values corresponding to poutcome = nonexistent¶

In [36]:
df.loc[df.poutcome=='nonexistent']['previous'].sum()
Out[36]:
0

Missing Value Treatment¶

Imputing Unknowns with Mode in columns except for Default¶

In [37]:
df.job[df.job=='unknown'] = np.nan
df.marital[df.marital=='unknown'] = np.nan
df.education[df.education=='unknown'] = np.nan
df.housing[df.housing=='unknown'] = np.nan
df.loan[df.loan=='unknown'] = np.nan
In [38]:
df.job.fillna(df.job.mode()[0],inplace=True)
df.marital.fillna(df.marital.mode()[0],inplace=True)
df.education.fillna(df.education.mode()[0],inplace=True)
df.housing.fillna(df.housing.mode()[0],inplace=True)
df.loan.fillna(df.loan.mode()[0],inplace=True)
In [39]:
df.shape
Out[39]:
(37084, 26)

Analysis of Percentage Turnout of Marketing Campaign¶

In [40]:
percent_turnout = df['y'].value_counts(normalize=True)*100
percent_turnout = pd.DataFrame(percent_turnout)
percent_turnout.columns = ['response']
percent_turnout
Out[40]:
response
no 88.652788
yes 11.347212
In [41]:
percent_turnout.plot(kind='bar')

Right Mode to contact the customers (telephone or mobile)¶

In [42]:
contact_percent_df = pd.crosstab(df.loc[df.y=='yes']['y'],df.contact,normalize=True)*100
contact_percent_df.index.name = 'response'
contact_percent_df
Out[42]:
contact cellular telephone
response
yes 82.984791 17.015209
In [43]:
contact_percent_df.plot(kind='bar')

Attempts made to turn a person to a successful depositor¶

In [44]:
attempts_made_df = pd.crosstab(df.loc[df.y=='yes']['y'],df.campaign,normalize=True)*100
attempts_made_df.index.name = 'response'
attempts_made_df
Out[44]:
campaign 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 23
response
yes 49.6673 25.998099 12.238593 5.584601 2.614068 1.639734 0.784221 0.3327 0.308935 0.237643 0.261407 0.071293 0.071293 0.023764 0.047529 0.095057 0.023764
In [45]:
attempts_made_df.plot(kind='bar')
In [46]:
attempts_made_df_prev = pd.crosstab(df.loc[df.y=='yes']['y'],df.previous,normalize=True)*100
attempts_made_df_prev.index.name = 'response'
attempts_made_df_prev
Out[46]:
previous 0 1 2 3 4 5 6
response
yes 67.609316 20.912548 7.604563 2.685361 0.831749 0.285171 0.071293
In [47]:
attempts_made_df_prev.plot(kind='bar')

Observations

  1. It seems that the percentage turnout of the marketing campaign is roughly 11%, which indicates that the campaign is not that successful in converting majority of the target customers towards term deposits.
  2. Amongst the customers which actually agreed to place a deposit, roughly 80% of them were contacted by mobile. Hence mobile/cellular is a more preferred mode of contacting the customers. This might also be because of the fact that telephone usage have been diminished with the advancements in technologies and hence most customers might not posses those.
  3. Amongst the customers which actually agreed to place a deposit, almost 88% of them were contacted 3 times or less for the current campaign scheme. Hence, majority of the customers who were convinced for the scheme required 3 or less attempts.
  4. Amongst the customers which were convinced during the previous campaign, 95% of them were contacted 2 times or less. Hence, increase in the total attempts beyond 3 will not necessarily drive more customers towards the scheme.
  5. Detailed analysis of all the hypothesis will be done in Univariate and Bivariate Analysis through various plots and statistical tests.

Step 2: Exploratory Data Analysis and Hypothesis Testing¶

Univariate Analysis¶

1. Numerical Features¶

In [48]:
df.dtypes
Out[48]:
customer_id         int64
age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
region_code        object
state_code         object
city_code          object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp_var_rate      float64
cons_price_idx    float64
cons_conf_idx     float64
euribor3m         float64
nr_employed       float64
postal_code        object
y                  object
dtype: object
In [49]:
numerical_df = df.select_dtypes(['int64','float64'])
In [50]:
numerical_df.dtypes
Out[50]:
customer_id         int64
age                 int64
duration            int64
campaign            int64
pdays               int64
previous            int64
emp_var_rate      float64
cons_price_idx    float64
cons_conf_idx     float64
euribor3m         float64
nr_employed       float64
dtype: object
In [51]:
numerical_df.describe()
Out[51]:
customer_id age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed
count 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000
mean 18542.500000 40.042714 258.237946 2.569545 -0.741721 0.172986 0.082669 93.576076 -40.505183 3.621668 5167.058664
std 10705.373028 10.432965 258.730909 2.770611 1.515288 0.495681 1.568997 0.578493 4.622045 1.733972 72.196605
min 1.000000 17.000000 0.000000 1.000000 -1.000000 0.000000 -3.400000 92.201000 -50.800000 0.634000 4963.600000
25% 9271.750000 32.000000 102.000000 1.000000 -1.000000 0.000000 -1.800000 93.075000 -42.700000 1.344000 5099.100000
50% 18542.500000 38.000000 180.000000 2.000000 -1.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 27813.250000 47.000000 319.250000 3.000000 -1.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 37084.000000 98.000000 4918.000000 56.000000 27.000000 7.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000
In [52]:
group1 = ['age','duration','pdays']
group2 = ['campaign','previous']
group3 = ['euribor3m','nr_employed','emp_var_rate']
group4 = ['cons_price_idx','cons_conf_idx']
In [53]:
def UVA_KDE(data, var_group):
    size = len(var_group)
    plt.figure(figsize = (7*size,3), dpi = 100)
    for i,var in enumerate(var_group):
        mini = data[var].min()
        maxi = data[var].max()
        ran = data[var].max()-data[var].min()
        mean = data[var].mean()
        median = data[var].median()
        st_dev = data[var].std()
        skew = data[var].skew()
        kurt = data[var].kurtosis()
        points = mean-st_dev, mean+st_dev
        plt.subplot(1,size,i+1)
        sns.kdeplot(data[var], shade=True)
        sns.lineplot(points, [0,0], color = 'black', label = "std_dev")
        sns.scatterplot([mini,maxi], [0,0], color = 'orange', label = "min/max")
        sns.scatterplot([mean], [0], color = 'red', label = "mean")
        sns.scatterplot([median], [0], color = 'blue', label = "median")
        plt.xlabel('{}'.format(var), fontsize = 20)
        plt.ylabel('density')
        plt.title('kurtosis = {};\nskew = {}; range = {}\nmean = {}; median = {}'.format(round(kurt,2),round(skew,2),
                                                                                         round(ran,2),
                                                                                         round(mean,2),
                                                                                         round(median,2)))
In [54]:
UVA_KDE(numerical_df,group1)
No description has been provided for this image
In [55]:
UVA_KDE(numerical_df,group2)
No description has been provided for this image
In [56]:
UVA_KDE(numerical_df,group3)
No description has been provided for this image
In [57]:
UVA_KDE(numerical_df,group4)
No description has been provided for this image

Observations

  • GROUP1
    1. age varaiable is a continuous numerical variable with a mean age of 40. But its distribution is slighly right skewed, indicative of presence of some large values(>65) which can be outliers and need to be analysed further using boxplots and treated accordingly.
    2. duration variable is also a continuous numerical variable with a mean duration of 258. But its distribution is also right skewed indicative of the presence of some large outliers which need to be analysed further ausing boxplots and treated.
    3. pdays is a discrete numeric column and the kde plot indicates that most of the customers are contaced for the first time during the current campaign.
  • GROUP2
    1. campaign is a discrete numeric column with a median of 2 which indicates that 50% of the customers were contacted 2 times or less during the current campaign. However, this column still has some outlier values which can be analysed using boxplots and treated accordingly.
    2. previous column is also discrete numerical and the distribution indicates that most of the customers were not contacted at all previously.
  • GROUP3
    1. euriobr3m which is a daily indicator seems to have a bimodal distribution with majority of the customers having a value of 1 and 5.
    2. nr_employed which is a quarterly indicator has multimodal distribution with majority of customers having a score of 5100 and 5200 (higher values) indicative of high proportion of workforce involved in jobs and services.
    3. emp_var_rate seems to have a multimodal distribution with majority of customers having a value of 1.5 or -2 indicative of the fact that we have high proportion of customers from both currently employed or actively seeking employment and those who are unemployed or retired.
  • GROUP4
    1. consumer price index has a multimodal distribution with majority of the index numbers towards the higher end of the spectrum indicating of surge in the prices for availing bank services over time.
    2. consumer confidence index also has a multimodal distribution with the majority index numbers are large negatives indicating that the customers are generally pessimistic of their financial situation. This might be one of the reasons why these customers show interests in fixed deposits schemes as a safe mode of investment.
In [58]:
import stats
In [59]:
def UVA_boxplot(data, var_group):
    size = len(var_group)
    plt.figure(figsize = (7*size,4), dpi = 100)
    from scipy import stats
    for i,var in enumerate(var_group):
   
        quant25 = data[var].quantile(0.25)
        quant75 = data[var].quantile(0.75)
        IQR = quant75 - quant25
        med = data[var].median()
        whis_low = quant25-(1.5*IQR)
        whis_high = quant75+(1.5*IQR)
        skew = data[var].skew()
        mini = data[var].min()
        maxi = data[var].max()
        kurt = data[var].kurtosis()
        shapiro_test = stats.shapiro(data[var])

        outlier_high = len(data[var][data[var]>whis_high])
        outlier_low = len(data[var][data[var]<whis_low])

        plt.subplot(1,size,i+1)
        sns.boxplot(data[var], orient="v")
        plt.ylabel('{}'.format(var))
        plt.title('Skew = {}; Range = {} \n 1st, 2nd, 3rd  quartile = {};\n Whiskers = {}; kurtosis = {} ; \n Number of Outlier (low/high) = {} '.format(
                                                                                                       round(skew,2),
                                                                                                       (round(mini,2),round(maxi,2)),
                                                                                                       (round(quant25,2),round(med,2),round(quant75,2)),
                                                                                                       (round(whis_low,2), round(whis_high,2)), 
                                                                                                       (round(kurt,2)),
                                                                                                       (outlier_low,outlier_high)
                                                                                                       ))
In [60]:
UVA_boxplot(numerical_df,group1)
No description has been provided for this image
In [61]:
UVA_boxplot(numerical_df,group2)
No description has been provided for this image
In [62]:
UVA_boxplot(numerical_df,group3)
No description has been provided for this image
In [63]:
UVA_boxplot(numerical_df,group4)
No description has been provided for this image

Observations

  1. age column seems to have some outliers i.e. values > 70 which is a bit odd. So these values can be clipped and made to 70.
  2. duration and campaign columns also have certain outliers. However, these values should be checked for correlation. If the correlation is high, then these columns do not need any outlier treatment. However, if the correlation values are very low, then these values will be clipped to their highest whisker values.
  3. pdays and previous columns have outliers. But these don't need any treatment as these indicate only the number of times customers were contacted previously.
  4. cons_conf_idx also has some small number of outliers. But the difference between the highest whisker value and the outlier values is very small. Hence, they will be left unchanged.

2. Categorical Features

In [64]:
categorical_df = df.select_dtypes(['object'])
categorical_df.dtypes
Out[64]:
job            object
marital        object
education      object
default        object
housing        object
loan           object
region_code    object
state_code     object
city_code      object
contact        object
month          object
day_of_week    object
poutcome       object
postal_code    object
y              object
dtype: object
In [65]:
categorical_df.describe()
Out[65]:
job marital education default housing loan region_code state_code city_code contact month day_of_week poutcome postal_code y
count 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084 37084
unique 11 3 7 3 2 2 4 49 531 2 10 5 3 631 2
top admin married university.degree no yes no 4 S2 C21 cellular may thu nonexistent 10035 no
freq 9726 22551 12520 29382 20315 31443 11914 7427 3422 23522 12420 7778 32023 974 32876
In [66]:
def UVA_category(data, var_group):
    size = len(var_group)
    plt.figure(figsize = (7*size,7), dpi = 100)
    for i,var in enumerate(var_group):
        norm_count = round(data[var].value_counts(normalize = True)*100,2)
        n_uni = data[var].nunique()
        sns.set_style('darkgrid')
        plt.subplot(1,size,i+1)
        sns.barplot(norm_count, norm_count.index , order = norm_count.index)
        plt.xlabel('fraction/percent', fontsize = 20)
        plt.ylabel('{}'.format(var), fontsize = 20)
        plt.title('n_uniques = {} \n value counts \n {};'.format(n_uni,norm_count))
In [67]:
group1 = ['job','marital','education']
group2 = ['default','housing','loan']
group3 = ['contact','month','day_of_week']
group4 = ['poutcome','y']
In [68]:
UVA_category(categorical_df,group1)
No description has been provided for this image
In [69]:
UVA_category(categorical_df,group2)
No description has been provided for this image
In [70]:
UVA_category(categorical_df,group3)
No description has been provided for this image
In [71]:
UVA_category(categorical_df,group4)
No description has been provided for this image

Observations¶

  1. Almost 75% of the customers are involved in either administrative, blue-collar, services or technician job.
  2. Almost 88% of the customers are married or single.
  3. Almost 82% of the customers have a university degree, high school degree, basic_9y education or professional course completion. Hence majority of the customers are well educated.
  4. Almost 80% of the customers have not defaulted in the credits in past which is a good sign for the bank.
  5. Housing loans are taken by approximately 50% of the customers.
  6. 80% of the customers don't have any personal loans.
  7. Almost 65% of the customers have cellular as their category of contact for the current campaign.
  8. Almost 80% of the campaigns seems to be running in the months of may, june, july and august for the bank.
  9. Day of the week seem to be equally distributed across all the working days indicating that the customers are being reached on all working days equally.
  10. As 85% of the customers haven't been contacted previously, they don't have any details corresponding to the outcome of the previous campaign. This might be because these customers are newly engaged with the bank so, the bank must focus on exciting schemes in the future too in order to retain them.
  11. The campaign doesn't seem to be successful as majority of the customers haven't invested in the term deposit policies. The bank can either increase their interiest rates for this policy or also try to approach these customers through other media channels like digital platforms, personalized advertisements, sms, emails etc. in future.

Bivariate Analysis¶

1. Numerical-Numerical¶

In [72]:
numerical_df.dtypes
Out[72]:
customer_id         int64
age                 int64
duration            int64
campaign            int64
pdays               int64
previous            int64
emp_var_rate      float64
cons_price_idx    float64
cons_conf_idx     float64
euribor3m         float64
nr_employed       float64
dtype: object
In [73]:
numerical_df = numerical_df.drop('customer_id',axis=1)
In [74]:
rounded_corr = round(numerical_df.corr(method='pearson'),2)
plt.figure(figsize=(8,8))
ax = sns.heatmap(rounded_corr,cmap = 'BuPu', annot=True)
plt.show()
No description has been provided for this image

Observations

  1. pdays and previous seem to have some positive correlation between them
  2. previous and emp_var_rate seem to have a some medium negative correlation
  3. cons_price_idx, euribor3m and nr_employed seem to have very strong positive correlation with emp_var_rate
  4. eurobor3m, nr_employed seem to have some positive correlation with cons_price_idx
  5. previous has some negative correlation with euribor3m and nr_employed seem to have a high positive correlation with euribor3m
  6. previous has some negative correlation with nr_employed
  7. campaign and duration don't seem to have any significant correlation between them. This is in opposite to our supposition in univariate analysis that high campaign values would be related to high duration. So the outliers in these columns will be clipped to their respective highest whisker values.
In [75]:
sns.scatterplot(numerical_df['pdays'],numerical_df['previous'])
Out[75]:
<AxesSubplot:xlabel='pdays', ylabel='previous'>
No description has been provided for this image
In [76]:
sns.scatterplot(numerical_df['previous'],numerical_df['emp_var_rate'])
Out[76]:
<AxesSubplot:xlabel='previous', ylabel='emp_var_rate'>
No description has been provided for this image
In [77]:
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['emp_var_rate'])
Out[77]:
<AxesSubplot:xlabel='cons_price_idx', ylabel='emp_var_rate'>
No description has been provided for this image
In [78]:
sns.scatterplot(numerical_df['euribor3m'],numerical_df['emp_var_rate'])
Out[78]:
<AxesSubplot:xlabel='euribor3m', ylabel='emp_var_rate'>
No description has been provided for this image
In [79]:
sns.scatterplot(numerical_df['nr_employed'],numerical_df['emp_var_rate'])
Out[79]:
<AxesSubplot:xlabel='nr_employed', ylabel='emp_var_rate'>
No description has been provided for this image
In [80]:
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['euribor3m'])
Out[80]:
<AxesSubplot:xlabel='cons_price_idx', ylabel='euribor3m'>
No description has been provided for this image
In [81]:
sns.scatterplot(numerical_df['cons_price_idx'],numerical_df['nr_employed'])
Out[81]:
<AxesSubplot:xlabel='cons_price_idx', ylabel='nr_employed'>
No description has been provided for this image
In [82]:
sns.scatterplot(numerical_df['previous'],numerical_df['euribor3m'])
Out[82]:
<AxesSubplot:xlabel='previous', ylabel='euribor3m'>
No description has been provided for this image
In [83]:
sns.scatterplot(numerical_df['previous'],numerical_df['nr_employed'])
Out[83]:
<AxesSubplot:xlabel='previous', ylabel='nr_employed'>
No description has been provided for this image
In [84]:
sns.scatterplot(numerical_df['euribor3m'],numerical_df['nr_employed'])
Out[84]:
<AxesSubplot:xlabel='euribor3m', ylabel='nr_employed'>
No description has been provided for this image

Observations¶

  1. Although the correlation heatmap gave us several high values for correlations, the scatter plot doesn't seem to indicate the same as the values of most of the numeric features are discrete. Hence, we won't be deleting any of the highly correlated features for our modelling purpose.

2. Categorical-Categorical¶

LIST OF HYPOTHESIS TO CHECK¶

Chi-square results

  • Do job level have a significant impact on the term deposit subscription?
  • Do marital status have a significant impact on the term deposit subscription?
  • Do education level have a significant impact on the term deposit subscription?
  • Do default have a significant impact on the term deposit subscription?
  • Do housing loans have a significant impact on the term deposit subscription?
  • Do personal loans have a significant impact on the term deposit subscription?
  • Do region_code have a significant impact on the term deposit subscription?
  • Do contact type have a significant impact on the term deposit subscription?
  • Do month of the year have a significant impact on the term deposit subscription?
  • Do day_of_week have a significant impact on the term deposit subscription?
  • Do previous campaign outcome have a significant impact on the term deposit subscription?
In [85]:
pd.options.plotting.backend = 'matplotlib'
In [86]:
from scipy.stats import chi2_contingency
In [87]:
def BVA_categorical_plot(data, tar, cat):
    data = data[[cat,tar]][:]

    table = pd.crosstab(data[tar],data[cat],)
    f_obs = np.array([table.iloc[0][:].values,
                    table.iloc[1][:].values])

    chi, p, dof, expected = chi2_contingency(f_obs)
    if p<0.05:
        sig = True
    else:
        sig = False
    plt.figure(figsize=(12,6))
    sns.countplot(x=cat, hue=tar, data=data)
    plt.title("p-value = {}\n difference significant? = {}\n Chi-square = {}".format(round(p,5),sig,round(chi,2)))
    dd = data.groupby(cat)[tar].value_counts(normalize=True).unstack()*100
    dd.plot(kind='bar', stacked='True',title=str(dd),figsize=(12,6))
In [88]:
BVA_categorical_plot(categorical_df,'y','job')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(job type does not have any impact on the term deposit subscription) is rejected and hence different job types have different term deposit subscription
  2. Students and retired people have the largest proportion of subscribers to our policy
In [89]:
BVA_categorical_plot(categorical_df,'y','marital')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(marital status does not have any impact on the term deposit subscription) is rejected and hence whether the person is married, single or divorced have different term deposit subscription
  2. Single people seem to have more term deposits than the other marital statuses
In [90]:
BVA_categorical_plot(categorical_df,'y','education')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(education level does not have any impact on the term deposit subscription) is rejected and hence different education levels have different amount of term deposit subscribers
  2. People having university degree seem to subscribe our policy the most
In [91]:
BVA_categorical_plot(categorical_df,'y','default')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that the our null hypothesis(default does not have any impact on the term deposit subscription) is rejected and hence whether the person has defaulted in the past or not affects the term deposit subscription
  2. Those who do not default seem to have the most term deposits
In [92]:
BVA_categorical_plot(categorical_df,'y','housing')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0.03 which is <0.05. This indicates that the our null hypothesis(people having housing loan does not have any impact on the term deposit subscription) is rejected and hence whether the person has a housing loan or not affects the term deposit subscription
  2. Those who have housing loans seem to have the most term deposits
In [93]:
BVA_categorical_plot(categorical_df,'y','loan')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0.32 which is >0.05. This indicates that we fail to reject our null hypothesis(people having housing loan does not have any impact on the term deposit subscription). Hence whether the person has a personal loan or not does not affect the term deposit subscription
In [94]:
BVA_categorical_plot(categorical_df,'y','region_code')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0.21 which is >0.05. This indicates that we fail to reject our null hypothesis(people living in different regions does not have any impact on the term deposit subscription). Hence whether the person lives in region 1,2,3 or 4 does not affect the term deposit subscription
In [95]:
BVA_categorical_plot(categorical_df,'y','contact')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(people contacted via telephone or mobile does not have any impact on the term deposit subscription)
  2. People contacted via mobile phones tend to have a higher subscription rate
In [96]:
BVA_categorical_plot(categorical_df,'y','month')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(month of campaign does not have any impact on the term deposit subscription) and hence different months have different subscription rates
  2. The months of march, sept, oct and dec tend to have a higher subscription rate
In [97]:
BVA_categorical_plot(categorical_df,'y','day_of_week')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 2e-5 which is <0.05. This indicates that we reject our null hypothesis(day of the week does not have any impact on the term deposit subscription) and hence different days have different subscription rates
  2. Thursdays tend to have the highest subscription rate
In [98]:
BVA_categorical_plot(categorical_df,'y','poutcome')
No description has been provided for this image
No description has been provided for this image

Observations

  1. As the p-values is 0 which is <0.05. This indicates that we reject our null hypothesis(outcome of previous campaign does not have any impact on the term deposit subscription) and hence previous campaigns do affect our subscription rates
  2. People who subscribed in the previous campaign tend to have the highest subscription rate in the current campaign as well

3. Categorical-Numerical¶

LIST OF HYPOTHESIS TO CHECK¶

  • Are the average age of subscribed customer equal to that of unsubscribed customer?
  • Are the average duration of contact of subscribed customer equal to that of unsubscribed customer?
  • Are the average current campaign contacts of subscribed customer equal to that of unsubscribed customer?
  • Are the average number of days of contact after previous campaign of subscribed customer equal to that of attrited customer?
  • Are the average previous campiagn contacts of subscribed customer equal to that of unsubscribed customer?
  • Are the average emp_var_rate of subscribed customer equal to that of unsubscribed customer?
  • Are the average cons_price_idx of subscribed customer equal to that of unsubscribed customer?
  • Are the average cons_conf_idx of subscribed customer equal to that of unsubscribed customer?
  • Are the average euribor3m of subscribed customer equal to that of unsubscribed customer?
  • Are the average nr_employed of subscribed customer equal to that of unsubscribed customer?
In [99]:
from numpy import sqrt, abs, round
from scipy.stats import norm
from scipy.stats import t as t_dist
In [100]:
def TwoSampZ(X1, X2, sigma1, sigma2, N1, N2):
    ovr_sigma = sqrt(sigma1**2/N1 + sigma2**2/N2)
    z = (X1 - X2)/ovr_sigma
    pval = 2*(1 - norm.cdf(abs(z)))
    return pval
'''''
Testing if they are equal or not. If p_value < 0.05 we reject the null hypothesis
'''''

def TwoSampT(X1, X2, sd1, sd2, n1, n2):
    ovr_sd = sqrt(sd1**2/n1 + sd2**2/n2)
    t = (X1 - X2)/ovr_sd
    df = n1+n2-2
    pval = 2*(1 - t_dist.cdf(abs(t),df))
    return pval
'''''
Testing if they are equal or not. If p_value < 0.05 we reject the null hypothesis
'''''

def Bivariate_cont_cat(data, cont, cat, category):
    x1 = data[cont][data[cat]==category][:]
    x2 = data[cont][~(data[cat]==category)][:]
    n1, n2 = x1.shape[0], x2.shape[0]
    m1, m2 = x1.mean(), x2.mean()
    std1, std2 = x1.std(), x2.mean()
    t_p_val = round(TwoSampT(m1, m2, std1, std2, n1, n2),4)
    z_p_val = round(TwoSampZ(m1, m2, std1, std2, n1, n2),4)
    table = pd.pivot_table(data=data, values=cont, columns=cat, aggfunc = np.mean)
    table = round(table,2)
    plt.figure(figsize = (15,6), dpi=140)
    plt.subplot(1,2,1)
    sns.barplot([str(category),'not {}'.format(category)], [m1, m2])
    plt.ylabel('mean {}'.format(cont))
    plt.xlabel(cat)
    plt.title('t-test p-value = {} \n z-test p-value = {}\n {}'.format(t_p_val, z_p_val, table))
    plt.subplot(1,2,2)
    sns.boxplot(x=cat, y=cont, data=data)
    plt.title('categorical boxplot')
In [101]:
Bivariate_cont_cat(df, 'age', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0.0007<0.05. Hence we reject the null hypothesis. So, the mean age of customers who subscribed and do not subscribe are statistically different
In [102]:
Bivariate_cont_cat(df, 'duration', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean duration of contact for customers who subscribed and do not subscribe are statistically different
In [103]:
Bivariate_cont_cat(df, 'campaign', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean number of times of contact in current campaign for customers who subscribed and do not subscribe are statistically different
In [104]:
Bivariate_cont_cat(df, 'pdays', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average amount of days after which the customers are called from the previous campiagn for customers who subscribed and do not subscribe are statistically different
In [105]:
Bivariate_cont_cat(df, 'previous', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the mean number of times of contact in previous campaign for customers who subscribed and do not subscribe are statistically different
In [106]:
Bivariate_cont_cat(df, 'emp_var_rate', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average emp_var_rates for customers who subscribed and do not subscribe are statistically different
In [107]:
Bivariate_cont_cat(df, 'cons_price_idx', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0.6303>0.05. Hence we fail to reject the null hypothesis. So, the average cons_price_idx for customers who subscribed and do not subscribe are statistically similar
In [108]:
Bivariate_cont_cat(df, 'cons_conf_idx', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0.0011<0.05. Hence we reject the null hypothesis. So, the average cons_conf_idx for customers who subscribed and do not subscribe are statistically different
In [109]:
Bivariate_cont_cat(df, 'euribor3m', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0<0.05. Hence we reject the null hypothesis. So, the average euribor3m index for customers who subscribed and do not subscribe are statistically different
In [110]:
Bivariate_cont_cat(df, 'nr_employed', 'y', 'yes')
No description has been provided for this image

Observations

  1. Since the p-value for both t-test and z-test are 0.0045<0.05. Hence we reject the null hypothesis. So, the average nr_employed index for customers who subscribed and do not subscribe are statistically different

Step3: Outlier Treatment and Data Transformation¶

In [111]:
df
Out[111]:
customer_id age job marital education default housing loan region_code state_code ... pdays previous poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m nr_employed postal_code y
0 1 56 services married high.school no no yes 3 S1 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
1 2 45 services married basic.9y unknown no no 3 S1 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 42420 no
2 3 59 admin married professional.course no no no 4 S2 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 90036 no
3 4 41 blue-collar married university.degree unknown no no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
4 5 24 technician single professional.course no yes no 3 S3 ... -1 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 33311 no
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37079 37080 73 retired married professional.course no yes no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10009 yes
37080 37081 46 blue-collar married professional.course no no no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10011 no
37081 37082 56 retired married university.degree no yes no 2 S16 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 10009 no
37082 37083 44 technician married professional.course no no no 4 S17 ... -1 0 nonexistent -1.1 94.767 -50.8 1.028 4963.6 85254 yes
37083 37084 74 retired married professional.course no yes no 1 S6 ... -1 1 failure -1.1 94.767 -50.8 1.028 4963.6 79109 no

37084 rows × 26 columns

Outlier Treatment for Age, Duration, Campaign Columns (As per our insights from EDA)

In [112]:
# Defing remove_outlier function which returns lower_range and Upper_range of IQR
def remove_outlier(col):
    sorted(col)
    Q1,Q3=np.percentile(col,[25,75])
    IQR=Q3-Q1
    lower_range= Q1-(1.5 * IQR)
    upper_range= Q3+(1.5 * IQR)
    return lower_range, upper_range

lr,ur=remove_outlier(df['duration'])
#df = df[df['duration']>=lr]
#df = df[df['duration']<=ur]
df['duration'][df['duration']<lr] = lr
df['duration'][df['duration']>ur] = ur

lr,ur=remove_outlier(df['campaign'])
df['campaign'][df['campaign']<lr] = lr
df['campaign'][df['campaign']>ur] = ur

lr,ur=remove_outlier(df['age'])
df['age'][df['age']<lr] = lr
df['age'][df['age']>ur] = ur

Dropping Postal Code to prevent Dimensionality Explosion and Overfitting

In [113]:
df.drop('postal_code', axis=1, inplace=True)
In [114]:
df.shape
Out[114]:
(37084, 25)

One Hot Encoding of Object Columns (as they are nominal in nature)

In [115]:
df =pd.get_dummies(df, columns=['job',
'marital',
'education',
'default',
'housing',
'loan',
'region_code',
'state_code',
'contact',
'month',
'day_of_week',
'city_code',
'poutcome',
'y'],drop_first=True)
In [116]:
df.shape
Out[116]:
(37084, 631)

Data Transformation/Normalization of Numeric Columns using z-score

In [117]:
from scipy.stats import zscore
In [121]:
df[['age','duration','campaign','pdays','previous','emp_var_rate','cons_price_idx', 'cons_conf_idx','nr_employed','euribor3m']] = df[['age','duration','campaign','pdays','previous','emp_var_rate','cons_price_idx', 'cons_conf_idx','nr_employed','euribor3m']].apply(zscore)
In [122]:
df.describe()
Out[122]:
customer_id age duration campaign pdays previous emp_var_rate cons_price_idx cons_conf_idx euribor3m ... city_code_C93 city_code_C94 city_code_C95 city_code_C96 city_code_C97 city_code_C98 city_code_C99 poutcome_nonexistent poutcome_success y_yes
count 37084.000000 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 3.708400e+04 ... 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000 37084.000000
mean 18542.500000 -1.231293e-16 -1.073840e-15 2.788444e-15 1.661555e-15 1.230906e-13 1.447244e-13 5.848812e-14 9.814922e-14 -8.836288e-14 ... 0.000539 0.003317 0.002103 0.001510 0.001915 0.002346 0.001160 0.863526 0.033411 0.113472
std 10705.373028 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 1.000013e+00 ... 0.023217 0.057497 0.045814 0.038831 0.043715 0.048380 0.034033 0.343296 0.179709 0.317173
min 1.000000 -2.265290e+00 -1.322932e+00 -8.221931e-01 -1.704507e-01 -3.489902e-01 -2.219709e+00 -2.377031e+00 -2.227359e+00 -1.723043e+00 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9271.750000 -7.851027e-01 -7.489512e-01 -8.221931e-01 -1.704507e-01 -3.489902e-01 -1.199935e+00 -8.661874e-01 -4.748648e-01 -1.313573e+00 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
50% 18542.500000 -1.930276e-01 -3.100249e-01 -1.779476e-01 -1.704507e-01 -3.489902e-01 6.484045e-01 2.989255e-01 -2.801432e-01 7.124386e-01 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 27813.250000 6.950849e-01 4.735709e-01 4.662980e-01 -1.704507e-01 -3.489902e-01 8.396120e-01 7.224457e-01 8.881865e-01 7.724174e-01 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000
max 37084.000000 2.915366e+00 2.307354e+00 2.399035e+00 1.830813e+01 1.377317e+01 8.396120e-01 2.058695e+00 2.943581e+00 8.208617e-01 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 631 columns

Step 4: Model Building & Performance Analysis¶

Separating Independent and Dependent Variables¶

In [125]:
X = df.drop(['customer_id','y_yes'],axis=1)
y = df[['y_yes']]

Train-Test Split (test_size = 0.2)¶

In [126]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=42,stratify=y)

Model 1: Grid Search on Random Forest Classifier¶

In [140]:
from sklearn.model_selection import GridSearchCV
In [161]:
param_grid={
'n_estimators':[50,75,100,150],
'max_depth':[8,10,12,15],
'min_samples_leaf':[10,20,30],
}
grid_search_rf = GridSearchCV(estimator=rf,param_grid=param_grid,cv=3)



grid_search_rf.fit(X_train,y_train)
Out[161]:
GridSearchCV(cv=3, estimator=RandomForestClassifier(max_depth=15),
             param_grid={'max_depth': [8, 10, 12, 15],
                         'min_samples_leaf': [10, 20, 30],
                         'n_estimators': [50, 75, 100, 150]})
In [162]:
print(grid_search_rf.best_params_)
{'max_depth': 15, 'min_samples_leaf': 10, 'n_estimators': 50}
In [171]:
from sklearn import metrics
y_train_predict = grid_search_rf.predict(X_train)
model_score =grid_search_rf.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, y_train_predict))
print(metrics.classification_report(y_train, y_train_predict))
0.902956146560151
[[26163   138]
 [ 2741   625]]
              precision    recall  f1-score   support

           0       0.91      0.99      0.95     26301
           1       0.82      0.19      0.30      3366

    accuracy                           0.90     29667
   macro avg       0.86      0.59      0.63     29667
weighted avg       0.90      0.90      0.87     29667

In [185]:
pred_prob = grid_search_rf.predict_proba(X_test)
grid_probs = pred_prob[:,1]
In [186]:
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,grid_probs)

f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.5753768844221105
0.16508158723302946
In [187]:
y_test_predict= (grid_probs>=threshold[ix]).astype('int')
model_score = grid_search_rf.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, y_test_predict))
print(metrics.classification_report(y_test, y_test_predict))
0.9040043144128354
[[5716  859]
 [ 155  687]]
              precision    recall  f1-score   support

           0       0.97      0.87      0.92      6575
           1       0.44      0.82      0.58       842

    accuracy                           0.86      7417
   macro avg       0.71      0.84      0.75      7417
weighted avg       0.91      0.86      0.88      7417

Model 2: Random Forest Classifier¶

In [188]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
Out[188]:
RandomForestClassifier()
In [189]:
from sklearn import metrics
rf_train_predict = rf.predict(X_train)
model_score =rf.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, rf_train_predict))
print(metrics.classification_report(y_train, rf_train_predict))
1.0
[[26301     0]
 [    0  3366]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     26301
           1       1.00      1.00      1.00      3366

    accuracy                           1.00     29667
   macro avg       1.00      1.00      1.00     29667
weighted avg       1.00      1.00      1.00     29667

In [190]:
pred_prob = rf.predict_proba(X_test)
rf_probs = pred_prob[:,1]
In [191]:
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,rf_probs)



f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6184782608695651
0.33
In [192]:
rf_test_predict= (rf_probs>=threshold[ix]).astype('int')
model_score = rf.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, rf_test_predict))
print(metrics.classification_report(y_test, rf_test_predict))
0.9111500606714305
[[6146  429]
 [ 273  569]]
              precision    recall  f1-score   support

           0       0.96      0.93      0.95      6575
           1       0.57      0.68      0.62       842

    accuracy                           0.91      7417
   macro avg       0.76      0.81      0.78      7417
weighted avg       0.91      0.91      0.91      7417

In [235]:
importances = rf.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices))[-10:], importances[indices][-10:], color='b', align='center')
plt.yticks(range(len(indices))[-10:], X_train.columns[indices][-10:])
plt.xlabel('Relative Importance')
plt.show()
No description has been provided for this image

Model 3: Extreme Gradient Boosting Machine¶

In [193]:
# !pip install xgboost
In [194]:
import xgboost as xgb
xg_reg = xgb.XGBClassifier()
xg_reg.fit(X_train,y_train)
Out[194]:
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)
In [195]:
from sklearn import metrics
xg_train_predict = xg_reg.predict(X_train)
model_score =xg_reg.score(X_train, y_train)
print(model_score)
print(metrics.confusion_matrix(y_train, xg_train_predict))
print(metrics.classification_report(y_train, xg_train_predict))
0.949303940405164
[[25827   474]
 [ 1030  2336]]
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     26301
           1       0.83      0.69      0.76      3366

    accuracy                           0.95     29667
   macro avg       0.90      0.84      0.86     29667
weighted avg       0.95      0.95      0.95     29667

In [196]:
pred_prob = xg_reg.predict_proba(X_test)
xg_probs = pred_prob[:,1]
In [197]:
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,xg_probs)

f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6584992343032159
0.31435093
In [198]:
xg_test_predict= (xg_probs>=threshold[ix]).astype('int')
model_score = xg_reg.score(X_test, y_test)
print(model_score)
print(metrics.confusion_matrix(y_test, xg_test_predict))
print(metrics.classification_report(y_test, xg_test_predict))
0.916003775111231
[[6103  472]
 [ 197  645]]
              precision    recall  f1-score   support

           0       0.97      0.93      0.95      6575
           1       0.58      0.77      0.66       842

    accuracy                           0.91      7417
   macro avg       0.77      0.85      0.80      7417
weighted avg       0.92      0.91      0.92      7417

In [237]:
importances = xg_reg.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices))[-10:], importances[indices][-10:], color='b', align='center')
plt.yticks(range(len(indices))[-10:], X_train.columns[indices][-10:])
plt.xlabel('Relative Importance')
plt.show()
No description has been provided for this image

Ensembling the 3 models for better generalization and performance improvement¶

In [253]:
en_probs = grid_probs*0.2 + rf_probs*0.2 + xg_probs*0.6
In [254]:
from sklearn.metrics import precision_recall_curve
precision, recall, threshold = precision_recall_curve(y_test,en_probs)

f_score = (2*precision*recall)/(precision+recall)
ix = np.argmax(f_score)
print(f_score[ix])
print(threshold[ix])
0.6554702495201535
0.2538157994347361
In [255]:
en_test_predict= (en_probs>=threshold[ix]).astype('int')
print(metrics.confusion_matrix(y_test, en_test_predict))
print(metrics.classification_report(y_test, en_test_predict))
[[6016  559]
 [ 159  683]]
              precision    recall  f1-score   support

           0       0.97      0.91      0.94      6575
           1       0.55      0.81      0.66       842

    accuracy                           0.90      7417
   macro avg       0.76      0.86      0.80      7417
weighted avg       0.93      0.90      0.91      7417

Step 5: Model Insights and Recommendations¶

Model Insights¶

  1. Three models have been built and analysed for the current scenario.
    • Grid Search CV on RandomForestClassifier
    • RandomForestClassifier
    • XGBoostClassifier
  2. SMOTE is not used as the original dataset has 90:10 type of class imbalance(too high) and synthetic data that will be generated will be bad.
  3. The purpose of using these three models is that GridSearchCV will help in hyperparameter tuning of the random forest classifier and will also include 3 fold cross-validation to ensure that the model generalizes well on unknown data. After this a typical RandomForest model with some hyperparameters(tuned using grid search but not shown here) are used. After this the XGBoost Classifier is used with default settings to as XGBoost can yield better results in case of classification due to boosting.
  4. From the feature importances of random forest classifier, we observe that the features of duration, social-economic indices, age, pdays, and previous campaign success are the most important and are given maximum importance while modelling.
  5. From the feature importances of xgboost classifier, we observe that the features of socio-economic indices, duration, success of previous campaign and contact mode as telephone are the most important and are given maximum importance while modelling.
  6. Here, recall score matters more than precision as we need to give more priority to incorrect False Negatives than to incorrect False Positives. This is because we want our model to predict most of the customers who actually turned up for the subscription from the campaign at the cost of even misclassifying some of the customers who actually didn't turn up for the subscription as positives. Hence, Recall is considered as the evaluation metric at the same time ensuring that the precision is not lost much through optimum f1_score.
  7. The results of the three models are then ensembled with appropriate weights to give us a model which combines the best world of all three models. Hence our final model has a recall score of 0.81 with f1_score of 0.66 and precision of of 0.55 for class 1 (see the classification report).

Recommendations and Suggestion for Improvment¶

  1. The Marketting team should invest their cost in calling the customers more than 3 times as majority (80%) of the term-deposit scheme subscribers have subscribed with 3 or less contacts.
  2. The cellular mode of contact is shown to be more effective than telephonic mode. This is because of the fact that majority of the customers prefer to use mobiles over telephones for communication.
  3. As shown in EDA, we could target our campiagns more towards students and retired individuals, as they have shown to have more proportion of subscribers for our term deposit than the rest of the job types. Also single people subsribe to our term deposit schemes more than married or divorced individuals as they are still in their phase of settling.
  4. We should also target those customers who haven't defaulted in the past as these seem to be interested in our subscribing our schemes more. This will also reduce the risk of default from our subscribers.
  5. We should also roll out most of the campaigns during the end of the year around september, october and december as thses months have shown high proportion of subsribers.
  6. EDA also shows us that the success of our previous campaign highly affects the conversion rates of our current campaign. Hence, the team should focus on investing more towards making every campaign a success as it would have some adstock in future campaigns too.
  7. The campaigns should also focus on selling their schemes to literate population rather than illeterate ones.
In [ ]: